Research Reproducibility

Brian Naughton | Wed 10 September 2014 | outsourcing | experiments reproducible research

It turns out that most research is not terribly reproducible, and that's a problem when you are trying to turn R into D, as Amgen and others have discovered. That's why reproducible research is currently a hot topic in science — so hot you can even take a course on it on Coursera.

Most reproducible research literature I've seen focuses on the data analysis component, which makes sense since this is the most straightforward place to start. Generally, this means that you package your code and data in such a way that others can rerun your experiment.

Robotic labs may also help with reproducibility. What is a robotic lab? To me, a robotic lab is a lab that is automated enough that protocols are defined in a machine-readable language. No research lab looks like this yet (with good reason), but some commercial labs (Transcriptic and Emerald Cloud Lab) are moving quickly in this direction. It's an awesome trend that could lead to a lot of changes in how science is done, including the reproducibility of research.

I'll ignore some of the current problems with robots, which are mainly around manual dexterity, skill and flexibility. For many experiments, these limitations will be too much to overcome, but historically, automation and mass production eventually beats manual labor, starting with the most boring and repeatable tasks.

The potential advantages of robotic labs are highly analogous to those of cloud computing:

high scalability
low cost (thanks to economies of scale and automation)
protocols can be transferred from lab to lab
reproducibility within and between labs

Research Papers

The current model of writing research papers dates back to Robert Boyle in the 17th century (at least according to a recent episode of In Our Time). A 21st century research paper could look pretty different:

Introduction: currently, these are usually just derivative summaries, so let's replace this with a wikipedia link or one of a standardized set of introductions: "#34, why cancer research matters"
Methods: this is now a machine-readable protocol written in the Wolfram Language or YAML or Python
Results: analysis and results are in an iPython notebook so you can see exactly what was done with the data and in what order
Conclusion: here we can go to town describing in free-text what happened in the experiment and why it matters

Maybe I am being too harsh on the introduction, but in general this model makes a lot of sense to me. (I recently learned that there is a company, Standard Analytics, trying to help people write papers this way).

I would guess that currently, most papers cannot be written this way, because the experimental technique is too temperamental to standardize, or the code to analyze the data is in flux, or the analysis only lives in an Excel spreadsheet, or a hundred other reasons. However, it's also possible you just shouldn't read these papers on the grounds that they are likely to contain errors.

Conclusion

It's unfortunate that even purely computational biology papers usually lack a simple method for the reader to run the code and reproduce the key results. Many of these papers could be written up as iPython notebooks instead of as papers. Of course, this is no way to get tenure, so only people who have nothing to prove, like Peter Norvig, do this.

(Some experiments rely on proprietary data, and so cannot provide full reproducibility. However, you can usually at least demonstrate that the method you devised works on simulated data...)

One recent project that took a novel approach was the Encode project, where the authors published a virtual machine image along with the paper to help you reproduce many of the key results. As VMs get simpler and more portable, we might see more of this too.

Research Reproducibility

Research Papers

Conclusion

Comments